Run(dom) Forrest

Botti Matteo(10531192) - Rossi Lorenzo(10869172) - Radici Ester(10615545)

Orf.gif

Part 1: Malware Infection Prediction

Loading library and dataset

Definition of some functions common to all different models

These series of functions were implented to manage the outliers.

A) Start by computing some basic statistics:

How many rows and how many columns are there in the data?

What are the names and datatypes in each column?

What percentage of computers are infected?

What percentage of computers have touch screens enabled?

What percentage of computers have solid-state hard drives?

What percentage of computers are gaming machines?

What percentage of computers have a firewall enabled?

What is the max/min/mean/median processor count?

What is the max/min/mean/median RAM on the machines?

What is the max/min/mean/median display size in inches?

How many countries and cities are there in the dataset?

**COMMENT**:

As far we have seen is that the dataset is balaced according to the label, since the percentage of devices infected is 50%, so we can go on with our analysis without implement any oversampling or downsampling. We also noticed that some variables have some MAX values much higher than the mean of the column, so we are going to meet lots outliers.

For instances:

  • in TotalPhysicalRAM the mean = 6114 is much higher than the median = 3072, that means that there is a positive asymmetry
  • while in ProcessorCount there is the opposite situation since the mean = 4 is lower than the median = 8, so we have a negative asymmetry</font>

Positive+asymmetry+Negative+asym.jpg